Associative memory of structured knowledge

A long standing challenge in biological and artificial intelligence is to understand how new knowledge can be constructed from known building blocks in a way that is amenable for computation by neuronal circuits. Here we focus on the task of storage and recall of structured knowledge in long-term memory. Specifically, we ask how recurrent neuronal networks can store and retrieve multiple knowledge structures. We model each structure as a set of binary relations between events and attributes (attributes may represent e.g., temporal order, spatial location, role in semantic structure), and map each structure to a distributed neuronal activity pattern using a vector symbolic architecture scheme.We then use associative memory plasticity rules to store the binarized patterns as fixed points in a recurrent network. By a combination of signal-to-noise analysis and numerical simulations, we demonstrate that our model allows for efficient storage of these knowledge structures, such that the memorized structures as well as their individual building blocks (e.g., events and attributes) can be subsequently retrieved from partial retrieving cues. We show that long-term memory of structured knowledge relies on a new principle of computation beyond the memory basins. Finally, we show that our model can be extended to store sequences of memories as single attractors.


Overview
In holographic reduced representation (HRR) 1 , the binding operation g is circular convolution where the individual components of g are given by with all subscripts are defined modulo N. For objects and attributes a, b ∈ R N , this creates a representation g(a, b) ∈ R N . The circular convolution operation is both associative and commutative, i.e. cconv(a, b) = cconv(b, a).
The corresponding decoding operation is given by the circular correlation where the individual components of f are given by In general the circular correlation operation is neither associative nor commutative. However, it can be related to circular convolution by defining the involution of a vector a as a * where a * j = a − j . Then ccorr(a, b) = cconv(a * , b) and a * as serves as an approximate inverse to a when both a and b are random vectors.

Clean-up and SNR
To quantify the effect of the ML clean-up operation described in the main text, we compute various statistics of the estimatorâ for the corresponding object a decoded from a relational structure S using the HRR unbinding operation. We consider the case where all objects and attributes are random vectors with components drawn iid, i.e., a id ∼ N (0, 1 N ), b id ∼ N (0, 1 N ) and decoding is done from a dictionary of size D.
Since the encoding operation is permutation invariant, the statistics for allâ decoded from the full structure should be the same. For simplicity we set = 1 and considerâ 1 . We index dictionary items so that indices < L correspond to the dictionary items contained in the structure S and indices L < < D corresponds to dictionary items that are not contained in S. Note that a i, a j, = 1 N δ i j δ (5) a i, a j, a k,s a l,s = 1 N 2 (δ i j δ δ kl δ ss + δ ik δ s δ jl δ δ ss + δ il δ s δ jk δ s ) Using the expression for the unbinding operation in Eqn. 4, we can express the estimatorâ 1 aŝ Using Eqn. 2, we can express this in terms of a 's and a 's, andâ j,1 aŝ a j,1 = a j,1 where we have defined the three noise terms as Here, ε is noise due to deviations in the normalization of a, ξ j,1 is noise from interference elements of different elements of the cue within the same attribute b 1 , and ξ j,2 and ξ j,2 are noise coming from interference between attribute b 1 with all other attributes b contained in the structure. Since components of a 1 appear an odd number of times in each term of the sums in ξ j,1 and ξ j,2 , we conclude that ξ j,1 = ξ j,2 = 0. Likewise, ε = 0. Using Eqns. 5 and 6, we find that the second moments and 2/17 correlation of the two noise terms are given by = 0 and from Eqn. 8, we see the noise in each component of the estimator is We now calculate the first two moments of the overlap of the estimator with the correct pattern a d ·â 1 to obtain the SNR for HRR defined in Eqn. 20 of Methods. a d ·â 1 can be expressed as We see from Eqn. 17 that a 1 ·â 1 = 1 and a d ·â 1 = 0, so the estimator is unbiased. The second moment is given by 1 N 4 (δ kk δ qq δ d δ 1 δ + δ kk δ qq δ kq δ δ 1 + δ kk δ qq δ j j δ d δ 1 δ + δ kk δ qq δ kq δ δ d + δ kq δ k q δ j j δ + δ kq δ k q δ j j δ δ d + δ kk δ qq δ kq δ 1 δ d δ + δ kk δ qq δ kq δ j j δ δ 1 + δ kk δ qq δ kq δ j j δ 1 δ d δ ) To summarize For L N, the SNR for overlaps defined in Eqn. 20 of the main text is then approximately

Binarization and Empirical SNR
When decoding from a binarized structure σ = sgn( S), the estimatorâ 1 is given bŷ where To compute the SNR, we start by evaluating the overlap between the estimatorâ 1 with items in the dictionary a d ·â 1 which can be expressed as While individual products of elements a ,i b , j are not Gaussian distributed, the expectation a d ·â 1 is the sum of N 2 random variables with correlations only occurring at higher order as the expectation of the product of different elements of each circular convolution is zero. Since all terms contained in the sums over indiced j and k in Eqn. 24 are only correlated at higher order, we can approximate each term in the sum as independent which gives us We now define the following three random variables where x 1 is a signal term and ξ is a noise term within the sign function. z 1 = z 2 = ξ = 0 and the variances are given by

4/17
For d = 1, z 1 = z = z and L 1 we can treat z and ξ as Gaussian. In this approximation a 1 ·â 1 becomes πL Likewise, the second moment is given by Putting Eqns. 32 and 33 together, the SNR of overlap when decoding a 1 from the binarized structure σ is given by We see from Eqn. 21 that binarizing S has decreased the SNR by a constant factor of 2 π .

Memory Initialization
We now determine the average initial overlap m 0 between a binarized retrieval structure σ 0 containing L 0 of the L relations in the binarized full structure σ in the case of random a 's and b 's are random vectors with components drawn iid as before. An expression for the average overlap m 0 between σ 0 and σ can be written as The overlap between components of the unbinarized structures S j and S j 0 with HRR binding is given by For large N we can approximate the sum in Eqn. 35 by treating the individual terms as as independent since there are no lower order correlations between terms. We can then define the random variables z 1 = S j0 and z 2 = S j − S j0 as approximately Gaussian distributed with zero mean and variances Σ 2 respectively. Then, Eqn. 35 can be approximated as For L 0 L Eqn. 37 can be further approximated as This implies that creating retrieval structures with L 0 out of L relations will result in retrieval cues with average overlap m 0 with the full memorized structure. The variance of the distribution of initial overlaps will vanish as N → ∞.
We use this relation to determine l c (α) as a function of L 0 /L using the value of m min (α) obtained for a Pseudo-inverse model with random binary patterns as memories 2 . Results for m min (α) are shown in Fig. 1a demonstrating a linear increase in the range α < 0.5. Substituting these results in Eqn. 31 closely predicts the value of l c (α) as shown in Fig. 1b. a b

Decoding Error
We derive a good approximation for the ML decoding error in terms of D, the size of the decoding dictionary, and SNR, the signal-to-noise ratio of overlaps defined in Eqn. 20 of Methods. The overlap O d1 between a dictionary item a d with the estimatorâ 1 for a 1 decoded from a structure S is given by where as in the previous section, we index items so that d ≤ L corresponds to overlap with patterns within S and L < d corresponds to overlaps with other items in the Dictionary not contained in S. For ML decoding, an error occurs if max(O 12 , . . . , O 1D ) ≥ O 11 . The probability of error P ε is then given by For large N, each O 1d essentially behaves as an independent Gaussian random variable with mean µ 1d and variance Σ 1d . For unbiased decoding schemes, µ 1L = µ 1D = 0. In general, we can consider decoding from a structure containing errors by setting µ 11 = µ, where µ is related to the overlap of the corrupted structure with the correct structure. Then P ε can be very well approximated as where Dz = dz √ 2π e − z 2 2 .

6/17
For HRR Σ 11 ≈ Σ 1L ≈ Σ 1D to leading order in 1 N . Defining Σ to be the leading order term in the variance, we obtain Identifying µ 2 Σ 2 with the SNR of overlaps, Eqn. 42 becomes given in the main text.

Limitations for Good Decoding
We can understand the dependence of P ε on the SNR and D in the large SNR regime by obtaining a saddlepoint approximation for the expression for P ε given in Eqn. 43. We start by rewriting Eqn. 43 as where f (z) is given by . So we are interested in the regime where 1 z + √ SNR. In this regime, f (z) is very large, so the integral in P ε can be approximated its saddlepoint value where z 0 is given by the solution of In this regime we can make the approximation which allows us to approximate Eqn. 47 as where for z + √ SNR 1 we used This can be further approximated as We can insert the SNR of overlaps into Eqn. 53 to determine the limits on the size of D for good decoding. For HRR the approximation for the error gives us which implies that for an error of δ 1 From this, we expect the error to be small as long and SNR is large and D is polynomial, and not exponential, in N.

Decoding After Memory Retrieval
To analyze the effect of memory retrieval on the decoding error, we start by considering decoding from a degraded binarized structure which is a Hamming distance m−1 2 from the uncorrupted binarized structure. In this case, the effective SNR is found by making the replacement a 1 ·â 1 → m in Eqn. 34. From this, we see that SNR(m) should take the form Whenâ 1 is decoded from an imperfectly retrieved structured from memory, we can instead make the replacement a 1 ·â 1 → √ cm. Here, c ∼ O(1) is a constant factor accounting for differences in overlap of the structure with relations within and outside the retrieval cue. Then SNR(m) takes the modified form The decoding error after retrieval from memory is then given by 3 Insights from Random Patterns

Network Order Parameters
For generic Hopfield networks, we can characterize the quality of memory retrieval by formally defining three network order parameters which quantify the overlap of the network state with the stored memories at each time step. The first is which represents the overlap of the network state σ (t) with the corresponding pattern σ µ at time t. If the initial state of the network has an O(1) with a small number of patterns, i.e., m 0 = m µ (0), the memory retrieval process can be sufficiently described by including an additional order parameter which represents the overlap of the state with all patterns except for µ. We can also define as the overlap of the state at time t with the initial state of the network after attempting to retrieve σ µ where the overlap between σ µ (0) and σ µ is m 0 . While m 0 is typically not considered for random patterns, it becomes relevant in the case of structured knowledge, where the retrieval cue can be constructed from a subset of relations rather than simply a corrupted version of a memory. For large N, the distributions of order parameters over fixed retrieval conditions, i.e., p(m), p(r) and p(m 0 ), are sharply peaked at the average values, denoted by m , r , and m 0 . In the limit N → ∞, these quantities depend solely on m 0 and α.

Overlap Distributions
As discussed in Methods, the empirical distribution p(m) is bimodal and takes the general form where p 1 is the probability that a structure is perfectly retrieved from memory and p m<1 (m) corresponds to the distribution of m for imperfectly retrieved memories. The shape of the distribution is characterized by p 1 , the width of the lower m mode, σ m and the mean of that mode, m * . These quantities all depend on the initial overlap m 0 used to retrieve the memory Results for p 1 , m * and σ m are shown in Fig. 2a, for several values of N and two values of α.
We also compare the overlap distributions for structures of length L with retrieval cue of length L 0 = 2 with the overlap distribution for a Pseudo-inverse network with random patterns for the corresponding values of m 0 given in Eqn. 35 in Fig. 3.

Retrieval Dynamics
When retrieving structures from memory (in the absence of noise), the update equation for the state of each neuron at time t is given by For all of the simulations in the main text, we consider parallel updates where all of the neurons 1 ≤ i ≤ N. We find that serial updates give qualitatively similar results to parallel updates. In general, for large N we find where m → m * as N → ∞. We find that serial updates obey the form in Eqn. 65 with a slightly smaller value of the coefficient f (α). Additionally, for serial dynamics, we can consider the robustness of our results under addition of noise in the updates. To do this, we use the Metropolis algorithm to obtain the final equilibrium state, where the amplitude of the noise is controlled by the inverse temperature β . At each update, the acceptance probability is given by We see that Eqn. 66 becomes equivalent to Eqn. 64 for the noiseless case β = ∞. In Fig. 4, we show the average overlap m of the network state with the cued memory as a function of time for several different initial overlaps m 0 retrieved through parallel dynamics and serial dynamics for several different values of β . In Fig. 5 we show m , r , m 0 as function of retrieval cue overlap m 0 . We see that the highest overlap is achieved for parallel updates when the serial updates are done in random order (alternative orders are discussed in 2 ). However, the results are qualitatively similar for the different dynamics shown, demonstrating that small amounts of noise have little effect.
In Fig. 5a and c, we see that outside of the basin of attraction of the cued memory, i.e., values of m 0 where m < 1, the final network state retains some memory of the initial state reflected by m 0 > m 0 .

Comparing Learning Rules
In the main text, we considered storing structured memories in recurrent neural networks where the synaptic weights are set via the Pseudo-inverse rule. This learning rule fully de-correlates linearly independent memories so that each memory is a perfect fixed point for α < 1. However, the Pseudo-inverse rule is both non-local and non-incremental. This makes it unlikely to be biologically implemented in a straightforward manner.
We now consider two learning rules that are both local and incremental. The first is the Hebb rule and the second is the Storkey rule proposed in ?, 3, 4 (see Eqns. (26)-(28) in Methods). The Storkey rule can be viewed as a biologically plausible approximation to the Pseudo-inverse rule for small α 4 . As a result, the storage capacity of this rule α c ≈ 0.4 is significantly higher than the Hebb rule (α c ≈ 0.14) and the basins of attraction are larger and more even across different memories ? .
The storage and retrieval of random memories in Hopfield networks near saturation, i.e., when the number of memories P scaling linearly in N, is limited by interference between different memories 5 . The overlaps between different memories are characterized by the following matrix Note that C µν is a P × P matrix that has the form of a sample covariance matrix.

12/17
We can compare the Pseudo-inverse, Hebb, and Storkey rules by looking at the forms of the local fields given by From Eqn. 64, we see that a memory is a fixed point if σ µ i h i > 0 for all i = 1, . . . , N. The elimination of the self-coupling term J ii greatly increases the basins of attraction of the pattern. In the limit N, P → ∞ this can be accomplished by replacing J i j with J i j − αδ i j 2 . The interference between memories is contained in the noise term in the local field. We start by considering the Hebb rule, with synaptic weight matrix given by The local field for the Hebb rule can be expressed as Next we consider the the Pseudo-inverse rule with synaptic weight matrix expressed in terms of C µν as It is useful to decompose the state of the network σ i (t) in two parts as follows where δ σ i (t) is orthogonal to all of the patterns (∑ i σ µ i δ σ i (t) = 0) and a µ is related to the order parameter m µ (t) via The local field for the Pseudo-inverse rule can then be expressed as We can see how Eqn. 74 suppresses the effects of overlaps by considering the state σ i = σ 1 i . Then a µ = δ µ1 regardless of any correlations of the memories and h P i = (1 − α)σ µ i . This implies that each memory is an eigenvector of J P i j with eigenvalue (1 − α) so that all memories are perfect fixed points for α ≥ 1. By contrast, we see from Eqn. 70 that h H i becomes contains additional noise terms of O 1 √ N because of overlaps with the other patterns. This noise reduces the capacity of the network as well as the size of the basins of attraction for each memory which are both further reduced if the patterns are not random and contain correlations. A simplified form of the Storkey rule, discussed in ?, 4, 6 , is given by Following the analysis in 6 We can find a more compact approximation for J S,P i j . We start with P = 1. Then J S,1 i j is given by For P = 2 we have For P memories, keeping terms up to O 1 N 2 , J S i j can be expressed as We can relate J S i j to the Pseudo-inverse rule given in Eqn. 71 by rewriting J P i j using the following series expansion of C −1 Plugging this expression back into Eqn. 71 gives us which is identical to the expression for J S i j given in the last line of Eqn. 80 up to terms of O 1 N 2 . The expansion of C in Eqn. 81 converges if the eigenvalues of C are all contained in the interval [0, 2]. For random memories with components σ µ i drawn iid from N (0, 1), the distribution of the eigenvalues of C µν is given by the Marchenko-Pastur distribution given by, which talks the following form for P, N → ∞ Since λ ∈ [λ − , λ + ], the expansion of C is valid for α < 0.17 implying that J S i j ≈ J P i j for sufficiently small α and large N. In Fig. 6 we compare m as a function of m 0 for two values of α for the three learning rules and in Fig. 7, we show m min (as defined in the main text) as a function of α.

Connection to Storage of Knowledge Structures
While the various learning rules and dynamics have quantitative effects on the values on the various network order parameters at fixed memory load and network size, their qualitative behavior as a function of the retrieval cue overlap m 0 remains roughly the same for small values of α below capacity.
In general, we have found that outside of the basin of attraction, m (and m * ) scales linearly with m 0 for all three learning rules and takes the form in Eqn. 7 where the coefficient f (α) depends on the learning rule.

15/17
In all cases, for large N where m → m * , the SNR behaves as As a result, while the choice of learning rule and retrieval dynamics modify the value of f (α) defined in Eqn. 65, they do not change the scaling behavior in Eqn. 85. Comparing P ε for structures stored using the Hebb, Storkey, and Pseudo-inverse rules in Fig. 8a. In general, we find that for low values of α 0.3, the error of decoding from memories stored via the Storkey rule is very similar to the Pseudo-inverse rule. Both are significantly lower the error obtained when decoding from memories stored using the Hebb rule. This suggests that the Storkey learning rule sufficiently decorrelates the different structures to allow for both efficient storage and retrieval of structured knowledge.

Temporal Sequences as Sequences of Attractors
Previously in 7,8 , it was shown that a temporal sequence of memories could be stored in a Hopfield network by adding a second asymmetric synaptic interaction of the form